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ABSTRACT 

Many online experiments exhibit dependence between sub- 
jects and items. For example, in online advertising, obser- 
vations that have a user or an ad in common are likely to be 
associated. Because of this, even in experiments involving 
millions of subjects, the difference in means between con- 
trol and treatment outcomes can have substantial variance. 
Previous mathematical and simulation results demonstrate 
that not accounting for this dependence structure can result 
in confidence intervals that are too narrow and inaccurate 
hypothesis tests. 

We examine how bootstrap methods that account for dif- 
fering levels of dependence structure perform in practice. 
We use multiple real datasets describing user behaviors on 
Facebook — responses to ads, search results, and News Feed 
stories — to generate data for experiments in which there 
is no effect of the treatment on average and then estimate 
empirical Type I error rates for each method. Results are 
supplemented with realistic simulations based on the data. 
Accounting for dependence within a single type of unit (i.e. 
within-user dependence) is often sufficient to get reasonable 
error rates. But when experiments have effects, as one might 
expect in the field, accounting for multiple units with a mul- 
tiway bootstrap can be necessary to get close to the adver- 
tised Type f error rates. This work provides guidance to 
experimenters on calibrating large-scale evaluation systems, 
and highlights the importance of analysis of inferential meth- 
ods under conditions in which experiments have effects. 
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statistical inference, online advertising, random effects, de- 
pendent data 

1. INTRODUCTION 

Experiments conducted on the Internet, including those 
taking place in social media feeds, online advertising, and 
search, frequently involve millions to tens of billions of ob- 
servations. This could lead to the perception that there is 
little uncertainty about experimental outcomes. However, 
treatment effects are often very small in absolute terms, 
so many observations can be required to distinguish them 
from noise. Furthermore, many Internet-scale datasets have 
a structure such that individual observations are not inde- 
pendent; rather, there is substantial dependence between 
observations of the same units. For example, consider an 
online advertising experiment in which there 1 million ad 
impressions, but these only include 1,000 distinct ads and 
10,000 distinct users. Clearly, the effective sample size may 
be something substantially less than 1 million, and there 
can be substantial uncertainty about the difference in click- 
through rate (CTR) between the treatment and control. 

Accounting for this dependence is important for statis- 
tical inference, including hypothesis testing and confidence 
interval estimation. Inferential procedures that neglect this 
dependence structure are expected to be anti-conservative: 
they will have higher Type I error rates than expected and, 
e.g., "95%" confidence intervals will include the true value 
less than 95% of the time. 

High false positive rates have substantial managerial con- 
sequences. For example, experiments using one popular 
experimentation platform at Facebook compare, on aver- 
age, 3.7 non-control conditions. With four comparisons and 
nominal Type I error rate a — 0.05, there should be a 
1 — (1 — 0.05) 4 = 18.5% chance that at least one condi- 
tion would be significant under the null hypothesis (i.e., 
one in 5.4 experiments with no effects may yield at least 
one significant condition). But if the true Type I error 
was considerably higher, say a — 0.2, one would have a 
1 — (1 — 0.2) 4 = 59% chance of having falsely rejected the 
null hypothesis. Given that many experiments involve com- 
paring multiple outcomes (i.e., metrics), in practice the re- 
sulting effects on decision making can be worse than this 
suggests: not only can there be errors in comparisons of the 
primary outcome, but incorrectly rejecting the null for some 
other outcome might delay or prevent the launch of a change 
that is otherwise beneficial. 

This paper describes sources and consequences of depen- 
dence in common applications of experimentation to Inter- 



net services. We posit a general data generating process and 
illustrate how experimental assignment procedures and com- 
mon effects of units (e.g., users and ads) affect the true un- 
certainty about experimental comparisons. We then evalu- 
ate independent, one-way, and multiway bootstrap methods 
for computing confidence intervals using null experiments 
("A/A tests") on three real datasets from Facebook: clicks 
on advertisements, search results, and content in the News 
Feed. We additionally modify these datasets to simulate 
systematic imbalance in items across conditions, as would 
result from changes to CTR prediction or ranking models. 
To examine performance under additional deviations from 
treatment effects, we conduct simulations using a realistic 
probit random effects model. 

Our primary contribution is providing guidance about 
when accounting for dependence among observations is 
most important: while previous work has shown that ne- 
glecting all dependence structure results in massive overcon- 
fidence, less work has examined how accounting for some 
sources of dependence, but not others, affects inference in 
practice. We conclude that analysts should certainly use a 
inferential procedure that accounts for dependence among 
observations of the units assigned to conditions (e.g., users), 
but that whether not additionally accounting for secondary 
units (e.g., ads, search results, links) makes for misleading 
inference is more likely to depend on the specific (usually 
partially unknown) deviations from the sharp null hypothe- 
sis that the experiment has no effects. 

The literature on routine Internet-scale experimentation 
(e.g. [8] [14]) largely does not address such questions about 
statistical inference. Some authors [14] suggest conducting 
null experiments to evaluate one's experimentation tools, 
but little is said about how these should be conducted and 
exactly what problems they can detect. We intend that, in 
addition to our results, this paper provides a blueprint for 
other experimenters who wish to evaluate and choose among 
inferential procedures in their own settings. 

2. DEPENDENCE IN EXPERIMENTS 

Many experiments allow observing the same units repeat- 
edly. We may observe responses from the same person many 
times and also observe responses to the same items many 
times. In this section, we examine how this affects our esti- 
mation of contrasts between experimental conditions, such 
as differences in means between treatment and control, i.e., 
the average treatment effect (ATE)rj 

Recent work in applied econometrics has been concerned 
with dependence due to clustering in data. It is now routine 
for work in empirical economics to consider and account for 
dependence in observations produced by one or more types 
of units. This is reflected in the fact that a recent paper by 
Cameron et al. [6] on dealing with dependence due to ob- 
serving two or more types of units repeatedly has been cited 
over 550 timesr] Concerns about such dependence have been 



1 There are likely other sources of dependence among obser- 
vations in online experiments, including some arising from 
general equilibria in advertising auctions, peer effects, and 
other "spillovers". In this paper, we restrict our attention to 
dependence due to repeated observations of the same units, 
for which we have inferential procedures, while these other 
sources of dependence take us into active areas of research 
beyond the scope of this paper. 
2 Citation count according to Google Scholar [2013-02-22]. 



featured centrally in methodological work in the context of a 
growing number of field experiments in economics and other 
social sciences Il2|. Similarly, work on two-way and ten- 
sor data in the context of recommender systems and ob- 
servational comparisons has emphasized the importance of 
accounting for multiway dependency [18||19| . And in psy- 
chometrics p] and psycholinguistics [2], investigators have 
identified problems with ignoring either of two sources of 
dependence. 

As practitioners conducting and analyzing massive Inter- 
net experiments, the degree of attention given to this area 
suggests a need to consider the consequences of dependence 
for our data. We present our effort to understand whether it 
would be necessary, in order to have inferential procedures 
with good performance, to account for multiple units caus- 
ing dependence in our data, or whether a single unit would 
suffice. 

2.1 Random effects model 

We use the random effects model to illustrate how depen- 
dence can affect uncertainty in ATEs, and motivate the use 
of the bootstrap. Random effects models provide a natural 
and general way to describe outcomes for data generated by 
combinations of units, in which each unit and each combi- 
nation of units contributes a random effect. In the two-way 
crossed random effects model [2] [24], each observation is 
generated by some function / of a linear combination of a 
grand mean, /i, a random effect Ui for the first variable, 
which (without loss of generality) we take to be the idiosyn- 
cratic deviation for user i, and a second random variable 
Pj for the idiosyncratic deviation for item j (e.g. an ad, a 
search result, a URL). Finally, we have a error term &y for 
each user's idiosyncratic response to each itemr] This final 
term could be caused by a number of factors, including how 
relevant the item is to the user. Thus, we have the model 

Y ij - f (A* + a i + ft + £*j ) 

at ~ H(0,al t ), /3 3 ~ %{0,a%), en ~ «(0,4,). 

Each random effect is modeled as being drawn from some 
distribution W with zero mean and some variance. In the ho- 
mogeneous random effects model, this variance is the same 
for each user or item (i.e., a ai — cr a ), whereas in a het- 
erogenous random effects model, each variable or groups of 
variables as may have their own variances. 

2.1.1 Comparisons of means 

We extend the basic random effects above to consider mul- 
tiple experimental conditions and develop expressions for the 
variance of a difference in means between experimental con- 
ditions. This illustrates how repeatedly observing the same 
units, and which units are randomly assigned to conditions, 
affects this variance. 

Let, without loss of generality, users (rather than items) 
be assigned to experimental conditions. That is, all obser- 
vations of the same user have the same experimental condi- 
tions, such that Dij = D^/ for all j 7^ j' . As in other work 
on random effects models where we observe only a small 
number of combinations of units [18[ |19| , we work condi- 
tional on D. 



3 For expository simplicity, we consider only a single obser- 
vation of each user-item pair. An addition error term can 
be included when there are repeated observations of pairs. 



For the sake of exposition, we restrict our attention to lin- 
ear models with normally distributed random effects. That 
is, the following analysis considers cases where Y is un- 
bounded, / is the identity function, and random effects are 
drawn from a multivariate normal distribution, so that 



To further simplify, we can introduce coefficients measur- 
ing how much units are duplicated in the data. Following 
previous work [181 [19] , we define 



,(<*) 



*£(«£>) 



(d)\2 



.(<*) 



N 



iE(a 



(d)\a 



N 



Y {d) 



,(d) 



+ a w +$ w +e w 



a I* 

& ~ Af(p,ll a ), fa ~AA(0,E^), e %] ~A/"(0,£ £ ). (1) 

Note that on, etc., are vectors, where each element corre- 
sponds to the random effect of a unit under a given treat- 
ment. 

We wish to estimate quantities comparing outcomes for 
different values of Djj -- most simply, the difference in 
means, or average treatment effect (ATE) for a binary treat- 
ment 
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While this difference cannot be directly observed from the 
data (since user-item pairs can only be assigned to one con- 
dition at a tim e), w e can estimate 5 with the difference in 
sample meansj [2l] . Our focus is then to consider the true 
variance of this estimator of 5 and, later, bootstrap methods 
for estimating that variance. 

The sample mean for each condition is 
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where, e.g., n im is the number of observations of user i in 
condition d. We then estimate the ATE with 5 = Y (1) -Y m . 
Consider the case where the treatment and control groups 
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are of equal size such that N = n,J = n„ . This enables 
simplifying the expression for 8 and its variance to 
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The first term is the contribution of random effects of users 
to the variance, and the second is the contribution of the 
random effects of items. The covariance term, present for 
items, is absent for users and user-item pairs since each is 
only observed in either the treatment or control. 



4 For true experiments, D is randomly assigned, but under 
some circumstances (i.e. conditional ignorability) tr eatm ent 
effects may be estimated without randomization Ffl |17| [21] . 



which are the average number of observations sharing the 
same user (the VaS) or item (the i>bs) as an observation 
(including itself). For the units assigned to conditions (in 
this case, users), either n im or n im is zero for each i\ for 
the non-assigned units (items), we need a measure of this 
between-condition duplication 
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Under the homogeneous random effects model (flj) , we can 

then simplify (pi to 
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This expression makes clear that if the random effects for 
items in the treatment and control are correlated (as we 
would usually expect) , then an increase in the balance of how 
often items appear in each condition reduces the variance of 
the estimated treatment effect. 

2.7.2 Sharp and non-sharp null hypotheses 

Under the sharp null hypothesis, the treatment has no av- 
erage or interaction effects; that is, the outcome for a partic- 
ular user or item is the same regardless of treatment assign- 
ment. In the context of our model, this would mean that 
the variances of the random effects are equal in both con- 
ditions and are perfectly correlated across conditions, such 
that, in addition to 8 = 0, 
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In this case, only random effects for items that are not bal- 
anced across conditions contribute to the variance of our 
ATE estimate: the contribution a single item j makes to 
the variance simplifies to (n. ■ — n. • ) o~p; that is, it de- 
pends only on the squared difference in duplication between 
treatment and control. It is easy to show that 
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between-condition duplication of observations of items. If 
items, like users, also only appear in either treatment or 
control, then Kb — v B -\- v B \ highlighting the resulting 
symmetry between users' and items' contributions to our 
uncertainty. 

When Q does not hold, we say that there are interaction 
effects of the treatment and units; for example, there may 
be an item-treatment interaction effect. 



In addition to deviations from the sharp null due to a 
non-zero average effect, many experiments in domains like 
search, ads, and recommender systems can result in imbal- 
ance and item-treatment interaction effects. For example, 
a new recommendation model may show different items to 
users and present items in more (and less) prominent posi- 
tions. Compared with a null treatment, these changes would 
produce a smaller ujb and deviations from ffl, including a 
lack of perfect covariance of treatment and control random 
effects. We can conceive of other treatments that do not 
change which items are observed, but make some items more 
likely to produce a response; this would correspond to devi- 
ations from p} only. 

Together these considerations highlight that we need to 
evaluate tests and confidence intervals under conditions 
other than the sharp null hypothesis, since the variance of 
our estimated difference can be substantially larger under 
other more realistic circumstances in which imbalance and 
interaction effects exist. 

2.1.3 Choice of experimental unit 

We have so far taken it as given that users are the units 
assigned to conditions, but the under the random effects 
model, it is clear that other choices, when possible, can in- 
crease precision. More generally, our variance expressions 
highlight that which units are assigned to conditions deter- 
mines which units can be expected to contribute most of the 
uncertainty to our estimates of treatment -control compar- 
isons. This creates an asymmetry in two-way data that is 



pected to underestimate the variance of statistics and thus 



produce confidence intervals with poor coverage 15 



not present in prior work on such dependence |6l 18l [19] . 

It is common in design of industrial, agricultura , and psy- 
chological experiments to carefully consider such assignment 
schemes, including using between-subjects, within-subjects, 
mixed designs, and blocking to reduce variance Hi while 
meeting constraints caused by, e.g., carryover effects. 

2.2 Bootstrapping dependent data 

The bootstrap [lOj offers a very general method for char- 
acterizing the sampling distribution of a statistic (e.g. a 
difference in means), and can be used to produce confidence 
intervals for experimental comparisons for many different 
data generating processes. The bootstrap distribution of a 
sample statistic is the distribution of that statistic under 
resampling 10* or reweighting [22] of the sample. In this 
section, we describe how the bootstrap can be applied to 
dependent data. We focus on a version of the bootstrap 
that uses independent weights, rather than the resampling 
bootstrap, since it is suitable for use in online (i.e., stream- 
ing) computational settings 19, 20 . 



2.2.1 The iid bootstrap 

In order to get a confidence interval for some statistic t, 
we produce R replicates of the the statistic, £*, computed 
on randomly reweighted versions of the sample. That is, for 
some replicate r G [1,-R], each observation Yij is randomly 
reweighted with weights Wij. These reweighted samples al- 
low us to estimate features of the sampling distribution of 
our statistic. We generally have Wij ~ G where Q is some 
distribution with mean and variance 1, such as Poisson(l) 
and Uniform{ 0, 2} [19| §3.3]. Note that in this bootstrap, 
each individual observation is reweighted independently of 
other observations, including other observations of the same 
units. Applied to two-way data, the iid bootstrap can be ex- 



2.2.2 Single-way bootstrap 

In the single-way bootstrap, or "block" or "cluster" boot- 
strap, the analyst chooses a single relevant type of unit 
(e.g., users) and all observations from the same unit are 
given the same random weight when reweighting. In other 
words, taken i as indexing the chosen type of unit, we have 
Wij = Wij' = Ui and u% ~ G for all j, j' . When the data only 
has one-way dependency, this procedure produces a boot- 
strap distribution that gives consistent confidence intervals. 
When the data has additional dependency structure, it can 
be anticonservative; we use real data and simulations to ex- 
amine how poorly it works in practice. 

2.2.3 Multiway bootstrap 

When there are two or more relevant units, analysts can 
use a bootstrap that reweights all relevant units. Under a 
more general random effects model than the one presented 
above, the multiway bootstrap produces variance estimates, 
and thus confidence intervals, that are mildly conservative 
•j The two-way bootstrap has been used for analyz- 



ing 



19 



argc online advertising experiments 13]. 



With two-way data, we have Wi- 



where u* 



and Vj ~ Q. That is, the random weights for an observa- 
tion is the product of two independently sampled weights 
assigned to unit i and unit j. For example, if in one repli- 
cate, user i gets weight 2 and item j gets weight 3 then all 
observations of the pair (i,j) get weight 2 x 3 = 6 in that 
replicate. Note that if either unit has a weight of 0, any com- 
bination of that unit with another unit will be given weight 
of 0. This procedure can be generalized to cover d-way data 
in a straightforward fashion n9] . 

2.2.4 Online bootstrapping 

For any statistic t that can be computed online, the single- 
way bootstrap can be implemented online as follows [19| |20| . 
On visiting each observation, use a hash of an identifier of 
each unit (e.g., a user ID) as the seed to the random num- 
ber generator for Q, draw R weights (one for each of the 
bootstrap replicates), and use these weights to update the 
running sufficient statistics for t*. The multiway bootstrap 
can be implemented online by using the same procedure as 
for the single-way bootstrap, but at each observation draw- 
ing R weights for each of its d units and computing their 
products. 

2.3 Alternative methods 

Bootstrap methods are attractive because they involve 
minimal assumptions and scale well to large datasets. There 
are other methods commonly used in practice for statistical 
inference with dependent data. One could fit a random ef- 
fects model to the data and then use likelihood-ratio tests 
or Bayesian inference [5] for the treatment effect parame- 
ters of interest. Random effects models require that the 
experimenter is able to specify a generative model in ad- 
vance, and require the analyst to make certain assumptions 
(e.g. homogeneous variances and normality), that are not 
needed for the bootstrap. Fitting very large crossed random 



5 Previous work in education research [5] and statistics 
examined the two-way bootstrap for balanced data. 
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Figure 1: An illustration of our method for computing true coverage rates for the bootstrap methods with the 
Search dataset. We compute 500 A/A tests to obtain nominal "95% confidence intervals" for the difference in 
means 5k.,-, and count the fraction of tests that accept the null hypothesis (e.g. indicate there is no significant 
difference in means). To show how results can vary between comparisons, we sort the results by E r [<5fc, r ], 
and highlight results that (incorrectly) reject the null. Anti-conservative tests — in this case, the iid and 
item-clustered bootstrap — reject in more than 5% of the experiments. Differences in the figure are shown 
relative to the grand mean. 



effects models also presents computational difficulties, espe- 
cially with datasets that span many nodes in a distributed 
environment. 

Recently there has been widespread adoption of cluster ro- 
bust Huber- White "sandwich" standard errors within econo- 
metrics, including extensions to two-way and multiway de- 
pendence pi. These methods are asymptotically consistent 
for a large class of M -estimators, but results for these meth- 
ods are not available for many statistics of interest in large 
online experiments, such as trimmed and Winsorized means. 
As with fitting random effects models, sandwich standard 
errors pose computational difficulties; they require multiple 
passes through the data and collecting all observations that 
share a unit, which is not necessary for the bootstrap. 

3. EMPIRICAL EVALUATION 

We evaluate each bootstrap method under random permu- 
tations and modifications of the datasets that correspond to 
various versions of a null hypothesis of no average effect of an 
experimental treatment on the primary outcome of interest. 
First, under the sharp null hypothesis, the treatment has no 
effects at all, both on the outcome and on which combina- 
tion of units are observed. Given our three real datasets, 
we can produce data consistent with the sharp null simply 
by randomly assigning units to treatment conditions. Many 
authors [81 113| |14] stress the importance of conducting such 
"A/A tests" as a validation of the combination of one's ran- 
dom assignment and statistical inference procedures, though 
it is generally not stated exactly how these null experiments 
should be carried out or what their limitations are. 

3.1 Data 

We examine click-through rate outcomes for three core 
product areas: ads, search, and News Feed. Due to the sen- 
sitivity of the data, we only focus on one category of items 
for each dataset when reporting results from our compu- 
tational experiments. For example, while there are many 
different types of items that show up in search results, such 



as friends, apps, groups, pages, Web results, etc., the results 
we present only apply to one of these item types. 

Ads. We analyze ad click-through rates for one type of ad 
unit for a popular advertising product on Facebook. Each 
impression corresponds to a single delivery of the ad to a 
user's Web browser. 

Search. We analyze search click-through rates for one 
type of search result on Facebook. Each impression is a val- 
idated delivery of an item in the "typeahead" results, and 
each click is a click on the item. Note that if an item pre- 
sented multiple times over several query reformulations, each 
is considered a separate impression. 

Feed. We analyze click-through rates for one type of story 
in the News Feed in a large country. Each impression cor- 
responds to a single delivery of the story to a viewer's Web 
browser, and a click corresponds to a click on the item's 
thumbnail or snippet. 

3.2 Computation 

To compute the A/A tests, we first partition the data into 
M segments based on the unit we wish to randomize over 
(i.e. the user ID) such that each segment contains all ob- 
servations with that corresponding identifier. We then seg- 
ment the data by taking the identifier of the unit we wish 
to randomize, concatenating it with a salt (i.e., an integer), 
computing this string's MD5 hash valuer! and assigning the 
unit to a segment number that is integer representation of 
the first 7 digits of the hashed value modulo M. In our ex- 
periments, we compute the bootstrapped difference in means 
between every even numbered segment m and m + 1 , yield- 
ing 50 comparisons per salt, and repeat this procedure for 
10 salts, yielding K = 500 null experiment comparisons for 
each method (Figure [l]). 



6 Although MD5 is not cryptographically safe (e.g. similar 
inputs may have correlated outputs) , in practice we find that 
MD5 yields similar results with greater computational effi- 
ciency compared to cryptographically safe hashing functions 
like SHA-1. 
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Figure 2: True coverage for nominal 95% confidence intervals produced by the iid, single-way, and multiway 
bootstrap for A/A tests segmented by user id as a function of time. Uncertainty estimates for the iid and 
item-level bootstrap become increasingly inaccurate over time, while the user-level and multiway bootstrap 
have the advertised or conservative Type I error rate. 
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Table 1: The amount of duplication present in our 
datasets for a single 1% segment of users. 



The confidence intervals for each method for each null 
experiment result from R = 500 bootstrap reweightings of 
the data. We augment the identifiers of the data using the 
corresponding 10 salts from the null experiments and ap- 
ply the bootstrap procedures described in Section |2.2| To 
determine whether or not a bootstrap experiment is signif- 
icant, we compute the mean and variance of the difference 
in means Skmr over all R replicates. The distributions of 
Skmr are asymptotically normal under the bootstrap, so we 
simply use quantiles of the normal to compute the central 
100(1 - a)% interval. 

To obtain the estimated true coverage under the sharp null 
hypothesis (zero mean difference, equal variance), we com- 
pute the proportion of times the K bootstrap tests indicate 
a significant difference in means at some level a. We treat 
each of the K comparisons as independent, and use the Wil- 
son score interval for binomial proportions [I] to estimate 
the uncertainty around the coverage. 

We may also obtain the coverage for a non-sharp null hy- 
pothesis by creating synthetic imbalance between the items 
in both conditions. To do this, for each pair of segments 
(m,m + 1), we downsample each item from either segment 
m or m + 1 (chosen with equal probability); in the down- 
sampled segment for some item j, its user-item pairs are 
(independently) removed with probability p. Thus, when 
p — 0, we have the sharp null hypothesis, and when p = 1, 
we have total imbalance (i.e., the two conditions contain dis- 
joint sets of items). 

3.3 Duplication 

A central quantity that contributes to the variance of 8 
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Figure 3: Duplication (y) for users and items over 
time relative to the first day. 



is the average number of observations that share the same 
user, z-Wr, an< i item, z/itcm- We give basic summary statistics 
about the duplication in the data for a random 1% segment 
used in our evaluation for the Ads, Search, and Feed datasets 
in Tablefl] For the restricted categories of items we consider 
in each dataset, there are more users exposed to ads than 
the search results or feed stories. While per- user duplication 
is similar across the three datasets, the per-item duplication 
for Ads is much higher than either Search or Feed. This 
pattern is congruent with the nature of the items: the num- 
ber of businesses that are actively advertising are far fewer 
than the number of users, while search and News Feed sto- 
ries tend to have a much longer tail of items that result in 
lower duplication. 

Experiments often run for many days; as the number of 
days increase, so does the duplication. Figure 13] shows how 
duplication increases over time. With the exception of Feed 
items, this relationship is rather linear both for user and 
items. The behavior for Feed may be explained by the way 
social media feeds work: unlike ads and search results, users 
see and interact with very recent content, therefore limiting 
the average number of users that may be exposed to an item. 

Given the two-way random effects model and the increas- 
ing relationship between the duplication coefficients and 
time, we expect that users and items may contribute sub- 



stantially to the variance of 5. Not taking these units into 
account when computing confidence intervals may result in 
poor coverage. Figure [2] shows the true coverage of the dif- 
ferent bootstrap methods for consecutively larger spans of 
time in each dataset. We find that the iid confidence inter- 
vals tend to be highly anti-conservative. For example, after 
two weeks of data collection, a search experiment that tests 
the difference in click-through rates between two equivalent 
groups of users could result in rejecting the null hypoth- 
esis nearly 50% of the time. We find that bootstrapping 
by the unit not being randomized over (the item) often 
leads to anti-conservative intervals, and that for the sharp 
null with little imbalance in items, the user-level bootstrap 
yields accurate coverage. The multiway bootstrap on the 
other hand remains conservative no matter how many days 
are considered. 

3.4 Imbalance in items 

Given how these A/A tests were constructed, there is ap- 
proximate balance of items across conditions, such that the 
primary contributors to the variance of 5 are the user and 
residual error components. However, if items are system- 
atically imbalanced across treatments (e.g. the experiment 
results in showing similarly relevant, but different ads), then 
item random effects can also make a substantial contribu- 
tion (Equation pj . To examine how such imbalance might 
affect the coverage of the confidence intervals in practice 
when the treatment has no average or interaction effects, 
we created imbalance by downsampling items from either 
condition with probability p (see Section [3. 2| for details). 
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Figure 4: True coverage for nominal 95% confidence 
intervals for each bootstrap method applied to data 
with varying levels of synthetic imbalance of items 
across conditions for 2 weeks of data. Imbalance 
does not appear to affect the accuracy of the true 
coverage for the multiway and user-level bootstrap, 
while the iid and item level bootstrap become more 
conservative when imbalance is greatest. 

Figure [4] shows the true coverage with varying censoring 
probabilities, p £ {0.3, 0.6, 1.0}. Despite the threat that the 
imbalance might result in a large item-level contribution to 
the variance, the coverage of the user bootstrap, which ne- 
glects this variance, remains approximately as advertised. 
This result may be due to a number of factors. First, the 
most straightforward expressions for V[<5] and the expected 
variance estimates from the bootstrap procedures involve as- 



suming a homogeneous random effects model, when it can 
actually be expected that the variances of the random ef- 
fects for, e.g., frequently observed users are different than 
those for infrequently observed users. Second, there is a rel- 
atively high amount of index-level duplication in the data, 
such that there are for many users and item a small number 
of user-item pairs observed; such duplication can cause the 
multiway bootstrap to be very conservative |19| Theorem 7] . 
For Feed and Search, the poor coverage of the iid and item 
bootstrap confidence intervals notably increases, though 
they continue to undercover. This is expected since, in ad- 
dition to creating imbalance, the downsampling procedure 
reduces within-condition duplication. 

4. SIMULATIONS 

We have seen how different bootstrap methods perform 
under the sharp null hypothesis and synthetic imbalance 
with three real-world domains. However, these A/A tests 
cannot tell us about how bootstrap procedures might per- 
form in situations where treatments do have effects. For 
example, an ads experiment that manipulates the display of 
certain advertising units may only affect certains items and 
not others a\. To explore these circumstances, we conduct 
simulations with a probit random effects model parameter- 
ized to mirror the kinds of outcomes described in the previ- 
ous section. We use this generative model to vary the pres- 
ence of an item-treatment interaction, a plausible source of 
violations of the sharp null hypothesis. 

We modify the model of |T| so that Y is binary and there 
is a single intercept common to both treatment and control, 
reflecting the lack of an ATE: 
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Also reflecting the absence of an ATE, we restrict the ran- 
dom effect variance to be the same in treatment and control. 
For example, the covariance matrix for the item random ef- 
fects is 
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To make realistic choices for the variances of the random 
effects, we fit a probit random effects models to the ads 
dataset from a large random sample of users in each of sev- 
eral small countries. This produced several estimates of a a 
and a p. We report on simulation results for <j a = 0.3, 
which is close to several of the estimates. Our estimates 
of ap often ranged from 0.2 to 0.9, so we present results for 
op S {0.1,0.3,0.5,1.0}. We set \x so as to achieve E[Yy] 
close to 0.02Q 

We constructed the set of observed user-item pairs used 
in the simulations by assigning each of 3,000 potential users 
and 200 potential ads to log-normally distributed scores. For 
each of 2N observations, we selected a particular user and 
ad with probability proportional to this score. This yielded 
a "layout" with 2481 unique users, 199 unique ads, and du- 
plication coefficients va = 30.9 and vb = 6077.4, which is 
similar to the Ads dataset. 



7 Since there is no scale to the latent variable ytj, we achieved 
this by in fact choosing a fixed /i = —2 and rescaling the 
random effect variances to sum to 1. 
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Figure 5: Effects of item— treatment interaction effects on true coverage of 95% confidence intervals. De- 
creasing p0, which makes the random item effects less correlated between treatment and control, reduces the 
coverage of user bootstrap confidence intervals. This effect is moderated by the magnitude of the item-level 
random effects. 



4.1 Item-treatment interactions 

Even if the treatment has no effects on average, it can have 
positive effects for some users and items and negative effects 
for others. Given our random effects model, we know that 
item-treatment interactions can increase the contribution of 
duplication of items to the variance of the mean difference. 

We vary item-treatment interactions by setting the corre- 
lation coefficient pp £ {0,0.25,0.5,0.75,1}. Perfect correla- 
tion pp ===== 1 corresponds to the sharp null hypothesis, while 
decreasing pp corresponding to an increasing proportion of 
item random effects being not shared across conditions. At 
the extreme of pp = 0, the random effect of an item in the 
treatment is completely independent of its random effect in 
the control. 

4.2 Results 

Figure [5] summarizes the results of 1000 simulations 
for each combination of parameter values. Without any 
item-treatment interaction, both the user and item boot- 
strap have approximately correct coverage since; this is 
attributable to the relatively low va in this simulation, and 
is consistent with the results from a small number of days 
in our real datasets. As the item-treatment interaction 
increases, the coverage of the user bootstrap confidence in- 
tervals drop substantially. For example, even with moderate 
values of op = 0.5 and pp ===== 0.75, a nominally 95% confi- 
dence interval has a true coverage of 87.5%. While do we 
not expect to observe the extremes of all item-level vari- 
ance being treatment specific (i.e., pp — 0), these results 
demonstrate that deviations from the sharp null in the form 
of item-treatment interaction have serious consequences for 
the single-way bootstrap. On the other hand, the multiway 
bootstrap remains mildly conservative even with large op 
and small pp. 

5. DISCUSSION 

Despite having a large number of individual observations, 
many settings for online experiments involve substantial de- 



pendence and small effects such that statistical inference re- 
mains a central concern. The preceding analysis of real and 
simulated data makes clear that methods which neglect de- 
pendence structure in these large experiments can result in 
high Type I error rates and confidence intervals with poor 
coverage. In each of our three datasets, the iid bootstrap 
performed very poorly, such that using it (or other methods 
assuming iid observations) would result in reaching incor- 
rect conclusions about the presence, sign, and magnitude of 
treatment effects [11] , 

On the other hand, neglecting dependence among obser- 
vations of units not assigned to conditions (the items) gen- 
erally did not result in lower coverage with our data. For 
each of the datasets, this remained the case even when we 
produced imbalance of items across conditions. Given the 
random effects model posited in Section [2. 1[ one might ex- 
pect this imbalance to make both the user and item contri- 
butions to the variance necessary to account for separately. 
Since bootstrapping multiple units and storing these repli- 
cates can have substantial costs in terms of computation and 
infrastructure, our results suggest that experimenters should 
consider whether a single-way bootstrap on the experimen- 
tal units may be practically sufficient, even in the presence 
of other clearly relevant units, such as ads and URLs. 

Nonetheless, neglecting dependence among observation of 
these non-experimental units may have substantial effects 
on coverage when the treatment has any effects. Most treat- 
ments are expected to have some effects. Our simulations 
with item-treatment interaction effects demonstrate that 
the coverage of the user bootstrap can be extremely sen- 
sitive to the presence of these effects. This highlights that 
using A/A tests only serves to validate inferential proce- 
dures under a narrow set of conditions (i.e., the sharp null 
hypothesis), but cannot detect other (potentially severe) in- 
ferential problems that occur in other circumstances. Given 
that experimenters expect treatment effects, and often want 
to know how large the average effects are, they should con- 
sider whether or not they wish to use a procedure that pro- 
vides a somewhat conservative measurement of uncertainty 



(i.e. the multiway bootstrap), or the user-level bootstrap, 
which correctly tests the less plausible sharp null. 

A limitation of the present work is that, from the perspec- 
tive of experimenters such as ourselves trying to evaluate in- 
ferential methods in practice, there is remaining gap between 
what is possible to learn from straightforward perturbations 
of real datasets and what is possible to learn from necessar- 
ily simplified generative models. Future work may develop 
more sophisticated ways of perturbing existing data and us- 
ing additional parameters estimated from real experiments 
to produce evaluations for data that more closely resemble 
outcomes in the field. 

This paper has been primarily concerned with Type I er- 
ror rates and the coverage of confidence intervals, but exper- 
imenters are equally concerned about Type II errors (failures 
to reject the null) and related errors such as incorrectly es- 
timating the direction or magnitude of effects. Many prin- 
cipled approaches to choosing how to assign units to one 
of many available treatments over time (e.g. solutions to 
multi-armed bandit problems) require correctly estimating 
one's uncertainty about the expected payoffs of the treat- 
ments [23]. Therefore, we expect that addressing multiway 
dependence will remain important when taking these ap- 
proaches as well. A related point is that experimenters of- 
ten exert considerable effort reducing the width of CIs by 
increasing precision through design and adjustment Bl pi 
|16| . Many of these methods could be applied in combina- 
tion with single or multiway bootstrapping. Finally, there 
may other practical ways to reduce the width of multiway 
bootstrap CIs through using linear combinations of variance 
estimates from different bootstrap procedures |6| |19| . 
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